House dust metagenome and pulmonary function in a US farming population

Background Chronic exposure to microorganisms inside homes can impact respiratory health. Few studies have used advanced sequencing methods to examine adult respiratory outcomes, especially continuous measures. We aimed to identify metagenomic profiles in house dust related to the quantitative traits of pulmonary function and airway inflammation in adults. Microbial communities, 1264 species (389 genera), in vacuumed bedroom dust from 779 homes in a US cohort were characterized by whole metagenome shotgun sequencing. We examined two overall microbial diversity measures: richness (the number of individual microbial species) and Shannon index (reflecting both richness and relative abundance). To identify specific differentially abundant genera, we applied the Lasso estimator with high-dimensional inference methods, a novel framework for analyzing microbiome data in relation to continuous traits after accounting for all taxa examined together. Results Pulmonary function measures (forced expiratory volume in one second (FEV1), forced vital capacity (FVC), and FEV1/FVC ratio) were not associated with overall dust microbial diversity. However, many individual microbial genera were differentially abundant (p-value < 0.05 controlling for all other microbial taxa examined) in relation to FEV1, FVC, or FEV1/FVC. Similarly, fractional exhaled nitric oxide (FeNO), a marker of airway inflammation, was unrelated to overall microbial diversity but associated with differential abundance for many individual genera. Several genera, including Limosilactobacillus, were associated with a pulmonary function measure and FeNO, while others, including Moraxella to FEV1/FVC and Stenotrophomonas to FeNO, were associated with a single trait. Conclusions Using state-of-the-art metagenomic sequencing, we identified specific microorganisms in indoor dust related to pulmonary function and airway inflammation. Some were previously associated with respiratory conditions; others were novel, suggesting specific environmental microbial components contribute to various respiratory outcomes. The methods used are applicable to studying microbiome in relation to other continuous outcomes. Video Abstract Supplementary Information The online version contains supplementary material available at 10.1186/s40168-024-01823-y.

Here y i represents a phenotype measurement i = 1, ..., n. x i = (x i1 , ...., x ip ) T contains all predictor variables including relative abundances of Association between predictors and the phenotype measured via the vector β = (β 1 , ...β p ) T , where β j measures the association between the j th predictor and the phenotype while controlling for all other .Inference on associations is performed via the test, H 0 : β j = 0 (no association), H a : β j = 0.
(1) This test is carried out for each predictor variable.egression coefficients the Lasso estimator and the of p-values for the test (1) is decribed in Algorithm 1.

Algorithm 1 Inference on regression parameters with many predictors
1: Compute the Lasso estimate β ˆ of β. 2: Compute regularized estimates μ ˆj by regressing the j th predictor x ij on the x i,−j , , x ij and the refitted estimate of the target coefficient where σ 2 j can also be consistently estimated from the data.

Whole genome shotgun metagenomic sequencing and quality control steps
Center for Microbiome Innovation, University of California San Diego completed library preparation, multiplexing, and whole genome shotgun sequencing using standard protocols [1].FastQC v0.11.5 [2] was used to assess the quality of reads by Phred quality score, GC content, the presence of adapters, overrepresented k-mers, duplicated reads rate, and PCR artifacts or contaminations.Low-quality and adaptor sequences were removed using Atropos [3].Bases with a Phred score < 15 and throw-away reads bp (-q 15 --minimum-length 100) were trimmed, starting from the end of the read.The internal sequencing standard PhiX 174 and human sequences were identified using bowtie2 [4], SAMtools [5], and BEDtools [6] to remove aligned sequences and their mates.The above procedures were performed in the Qiita pipeline [7] by the IGM Genomics Center at the University of California San Diego.
All raw reads of both ends (3' or 5') passed the basic FastQC figures (per base sequence quality and per sequence quality scores), and no low-quality sequences remained.Bimodal shape was observed in per sequence GC content, indicating the wide distribution of genome GC content across multiple species in metagenomic samples.
We then classified the resulting paired-end reads using Kraken2 v2.1.1 [8] with pre-compiled data comprising RefSeq genomes for bacteria, archaea, eukaryotes, fungi, viruses, and plasmids and NCBI taxonomy information, with a confidence score threshold of 0.05 (--confidence 0.05), to enhance the accuracy of taxonomic assignments [9].We then ran the Kraken2 output against Bracken v2.5.0 [10] with default parameters (-r 100 -l S -t 10) to quantify abundance at the species level.We built the Bracken database with the default 35-mers length.Tables S2 and S3 summarize the overall statistics of read sequences and the proportion of each host genome contaminant across samples.
Due to low biomass in the dust samples, we separately performed and processed two sequencing runs.Metagenomics datasets from low biomass samples are particularly vulnerable to microbial contamination from the sample collection instrument, sequencing kit, and laboratory reagents.We incorporated 'blank' controls by sequencing sterile water without adding dust sample DNA extractions [11].the decontam R package v1.10.0 [12] to identify contaminated DNA sequences not present in the sampled community for each run.We then removed contaminants identified in either run.Using this process, we filtered out 168 taxa (Table S4).After separately conducting preprocessing and filtering for each run, we generated pooled abundance data by summing the abundance data from both runs.